How to Actually Evaluate AI Agents (Evals That Catch Regressions)

Most AI demos are lies — not maliciously, just by selection. You see the run that worked, not the nine that didn't. A cherry-picked demo is indistinguishable from a robust system right up until it's in production failing on inputs the demo never tried. Evals are the only way to tell the difference, and the difference is the whole job.

The mistake most teams make is treating evals as a dashboard they glance at once. The version that actually protects you treats them like regression tests: a fixed set, scored automatically, with a threshold that fails the build when quality drops. If you can't put a number on "did this change make it worse," you're running on vibes and calling it judgment.

The reframe Evals aren't a report card you read after shipping. They're a gate you ship through. The goal isn't a pretty score — it's catching the regression before your users do.

1. Start With a Fixed Question Set

You can't measure drift without a baseline. The minimum viable eval is a curated set of 50–100 representative inputs with known-good expectations. Every meaningful change — a new model, a prompt edit, a different chunking strategy, a routing change — runs against that same set so the comparison is apples to apples. The set is an asset; grow it every time a real failure slips through (add the case that broke).

2. Score With Metrics, Not Eyeballs

For retrieval and grounded answers, three metrics catch most failures. I run these via RAGAS in my RAG Knowledge Engine:

Faithfulness: is every claim in the answer grounded in the retrieved context? Catches hallucination.
Answer relevance: does the answer actually address the question? Catches confident off-topic responses.
Context recall: did the retriever surface the chunks needed to answer? Catches retrieval failures upstream of the model.

For open-ended agent output where there's no exact string to match, an LLM-as-judge scores the response against a rubric. It's not perfect, but a consistent judge applied to a fixed set reliably surfaces relative regressions — which is what you care about between versions.

3. Test the Pieces, Not Just the Whole

A multi-agent system fails at the seams. Evaluate each agent's contract in isolation before you judge the pipeline. The agents in my agentic-systems repo show the unit-level version: the test-case generator is checked on whether it produces valid, runnable tests from a function signature; the code reviewer on whether it catches seeded bugs. When a pipeline regresses, component evals tell you which stage moved — far faster than staring at end-to-end output.

Fixed set

50–100 cases, same inputs every run — the baseline you diff against

Metrics

faithfulness, answer relevance, context recall + LLM-as-judge

Threshold

below the line, the build fails — evals as a gate, not a chart

4. Wire It Into CI

An eval you have to remember to run is an eval you won't run. Put it in the pipeline: on every significant change, the suite runs against the fixed set, and if faithfulness drops below your threshold (I use 0.85 for grounded Q&A), the change doesn't merge. That single rule converts "we think it's better" into "the numbers say it's not worse" — which is the only honest way to ship changes to a probabilistic system.

The Throughline

Evals complete the reliability triad with memory (stop repeating failures) and guardrails (stop drifting). Memory fixes the past, guardrails fix the present, evals protect the future. None of them demo well. All of them are why a system survives contact with real users — the argument I make in full in the boring infrastructure that actually ships.

What I Built

RAGAS-style scoring lives in rag-knowledge-engine; component-level evals across five topologies are in agentic-systems. For the retrieval side of the story, see RAG in production.